26 research outputs found
Recommended from our members
Robust Algorithms for Clustering with Applications to Data Integration
A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and scalable methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below.
1. While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of clustering methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. Additionally, we focus on scalability to handle web-scale datasets.
2. In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model that considers dependent edge formation and devise techniques for efficient cluster recovery
Connectivity of Random Annulus Graphs and the Geometric Block Model
We provide new connectivity results for {\em vertex-random graphs} or {\em
random annulus graphs} which are significant generalizations of random
geometric graphs. Random geometric graphs (RGG) are one of the most basic
models of random graphs for spatial networks proposed by Gilbert in 1961,
shortly after the introduction of the Erd\H{o}s-R\'{en}yi random graphs. They
resemble social networks in many ways (e.g. by spontaneously creating cluster
of nodes with high modularity). The connectivity properties of RGG have been
studied since its introduction, and analyzing them has been significantly
harder than their Erd\H{o}s-R\'{en}yi counterparts due to correlated edge
formation.
Our next contribution is in using the connectivity of random annulus graphs
to provide necessary and sufficient conditions for efficient recovery of
communities for {\em the geometric block model} (GBM). The GBM is a
probabilistic model for community detection defined over an RGG in a similar
spirit as the popular {\em stochastic block model}, which is defined over an
Erd\H{o}s-R\'{en}yi random graph. The geometric block model inherits the
transitivity properties of RGGs and thus models communities better than a
stochastic block model. However, analyzing them requires fresh perspectives as
all prior tools fail due to correlation in edge formation. We provide a simple
and efficient algorithm that can recover communities in GBM exactly with high
probability in the regime of connectivity
METAM: Goal-Oriented Data Discovery
Data is a central component of machine learning and causal inference tasks.
The availability of large amounts of data from sources such as open data
repositories, data lakes and data marketplaces creates an opportunity to
augment data and boost those tasks' performance. However, augmentation
techniques rely on a user manually discovering and shortlisting useful
candidate augmentations. Existing solutions do not leverage the synergy between
discovery and augmentation, thus under exploiting data.
In this paper, we introduce METAM, a novel goal-oriented framework that
queries the downstream task with a candidate dataset, forming a feedback loop
that automatically steers the discovery and augmentation process. To select
candidates efficiently, METAM leverages properties of the: i) data, ii) utility
function, and iii) solution set size. We show METAM's theoretical guarantees
and demonstrate those empirically on a broad set of tasks. All in all, we
demonstrate the promise of goal-oriented data discovery to modern data science
applications.Comment: ICDE 2023 pape